First, load the necessary libraries. In addition to the rpack package, we use tidyverse and dplyr for sample data manipulation and ggplot2 for plotting.
#install.packages("lcmix", repos=c("http://R-Forge.R-project.org",
# "http://cran.at.r-project.org"),dependencies=TRUE)
library(rpack)
library(tidyverse)
library(LaplacesDemon)
library(Matrix)
library(plotly)
library(Gmedian)
library(lcmix) # Added by Markku. Package can be found at R-Forge
library(parallel) # For parallel computing. rpack uses function "detectCores".
library(flexmix)
These are added by Markku. These external functions generate data from gamma distribution. Weights are determined either by 1) chance 2 ) based on the distance from the distribution mean: the greater the Euclidean distance between the point and the population mean is, the larger the weight is. “simulate_unif_grid” is a sub-function of the main function “simulate_gamma_mixture”.
source("CodeCollection/simulate_gamma_mixture.R")
source("CodeCollection/simulate_unif_grid.R")
source("CodeCollection/utility_functions.R")
# Get the colors for plotting
c_col <- get_cols()
For testing purposes, we first generate small and simple data set to test package flexm.
# Simple data set
m1 <- 0
m2 <- 50
sd1 <- sd2 <- 5
N1 <- 100
N2 <- 10
a <- rnorm(n=N1, mean=m1, sd=sd1)
b <- rnorm(n=N2, mean=m2, sd=sd2)
x <- c(a,b)
class <- c(rep('a', N1), rep('b', N2))
data <- data.frame(cbind(x=as.numeric(x), class=as.factor(class)))
Fit two gaussians to the data.
library("ggplot2")
p <- ggplot(data, aes(x = x)) +
geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black", fill = "white") +
geom_vline(xintercept = m1, col = "red", size = 2) +
geom_vline(xintercept = m2, col = "blue", size = 2)
p
set.seed(0)
mo1 <- FLXMRglm(family = "gaussian")
mo2 <- FLXMRglm(family = "gaussian")
flexfit <- flexmix(x ~ 1, data = data, k = 2, model = list(mo1, mo2))
# So how did we do? It looks like we got class assignments perfectly
print(table(clusters(flexfit), data$class))
##
## 1 2
## 1 0 10
## 2 100 0
Lastly, let’s visualize the fitted model along with the data.
c1 <- parameters(flexfit, component=1)[[1]]
c2 <- parameters(flexfit, component=2)[[1]]
#' Source: http://tinyheero.github.io/2015/10/13/mixture-model.html
#' Plot a Mixture Component
#'
#' @param x Input data
#' @param mu Mean of component
#' @param sigma Standard deviation of component
#' @param lam Mixture weight of component
plot_mix_comps <- function(x, mu, sigma, lam) {
lam * dnorm(x, mu, sigma)
}
lam <- table(clusters(flexfit))
ggplot(data) +
geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black", fill = "white") +
stat_function(geom = "line", fun = plot_mix_comps,
args = list(c1[1], c1[2], lam[1]/sum(lam)),
colour = "red", lwd = 1.5) +
stat_function(geom = "line", fun = plot_mix_comps,
args = list(c2[1], c2[2], lam[2]/sum(lam)),
colour = "blue", lwd = 1.5) +
ylab("Density")
All good here!
Let’s set up some data to be clustered. To ease cluster overlap, clusters are placed on a grid.
Outliers are sampled uniformly on the cluster grid.
Some simulated clusters have heavy weight points at edge of clusters (50/50 chance).**
# Initialize seed number:
#seed = Sys.time()
#seed = as.integer(seed)
#seed = seed %% 100000
seed = 10376
set.seed(seed)
k = 10 # ten clusters
n = c(20, 40, 60, 80, 50, 80, 60, 40, 20, 50) # uneven cluster sizes
n_out = 20 # nmb of outliers
# Generating 500 points from mixture of 10 gamma distributions.
test_dat = simulate_gamma_mixture(
n,
k,
n_out = n_out,
out_scale = 5,
scale_between_range = c(0, 1),
outgroup_alpha = 0.4,
place_on_grid = T,
overlap_scale = 0.5
)
true_mu <- test_dat$mu_true
test_dat <- test_dat$Y
# Ids in interactive plot
id <- 1:nrow(test_dat)
plot_sim <-
ggplot(data = test_dat, aes(
x = x,
y = y,
size = w,
label = id
)) +
geom_point() +
scale_size(range = c(2, 6)) + # Scale objects sizes
guides(color = guide_legend(# Point size in legend
override.aes = list(size = 5))) +
labs(x = "x", y = "y", title = "Unclustered data") +
theme(
legend.position = "right",
# Legend position and removing ticks from axis
axis.text.x = ggplot2::element_blank(),
axis.text.y = ggplot2::element_blank(),
axis.ticks = ggplot2::element_blank()
)
ggplotly(plot_sim, tooltip = c("id", "w"))
Fit a normal mixture model to the data we just simulated.
formula <- cbind(test_dat$x, test_dat$y) ~ 1
set.seed(1)
test_flexfit <- flexmix(
formula = formula,
k = k,
model = FLXMCmvnorm(diag = T))
print(table(clusters(test_flexfit), test_dat$orig_group))
##
## 1 2 3 4 5 6 7 8 9 10 11
## 1 0 0 0 0 0 55 0 0 0 0 2
## 2 1 20 0 0 45 1 60 0 8 2 1
## 3 19 0 0 0 0 22 0 0 0 0 0
## 4 0 20 0 77 1 0 0 0 0 47 2
## 5 0 0 2 3 3 1 0 1 12 1 12
## 6 0 0 58 0 0 1 0 0 0 0 3
## 7 0 0 0 0 1 0 0 39 0 0 0
Interestingly, we were able to find only 7 clusters.